Airbnb is among the most frequently used platforms to book short-term rentals all over the world. In this analysis, we put ourselves in the shoes of a tech-savy couple that currently plans a trip to Berlin and wants to book an apartment via Airbnb. Having access to city-specific Airbnb data, the goal of the analysis is therefore to find a regression model, which predicts the price that this couple would have to pay for a 4-night stay at some Airbnb apartment.
There are three main steps to this analysis: (i) the data exploration and feature selection, (ii) the model selection and validation, (iii) a quick summary on findings and recommendation.
First, we import the relevant libraries and define some of the basic settings for the analysis.
Next, we load the relevant data from insideairbnb.com. We cache this data so that it does not download every time that the document is knitted.
Now that the data is loaded, it helps to understand get a feel for the different variables. This part of the analysis is known as Exploratory Data Analysis. There are three substeps to this:
This tells us that we are looking at more than 18k Airbnb rentals in London, for which we have 74 variables. “Glimpse” also tells us that the variables are in all kinds of formats and likely require some manipulation for the actual analysis. For instance, “host_acceptance_rate” is in
Rows: 18,288
Columns: 74
$ id <dbl> 2015, 3176, 7071, 9991, 1~
$ listing_url <chr> "https://www.airbnb.com/r~
$ scrape_id <dbl> 2.021092e+13, 2.021092e+1~
$ last_scraped <date> 2021-09-22, 2021-09-22, ~
$ name <chr> "Berlin-Mitte Value! Quie~
$ description <chr> "Great location! <br />3~
$ neighborhood_overview <chr> "It is located in the for~
$ picture_url <chr> "https://a0.muscache.com/~
$ host_id <dbl> 2217, 3718, 17391, 33852,~
$ host_url <chr> "https://www.airbnb.com/u~
$ host_name <chr> "Ion", "Britta", "BrightR~
$ host_since <date> 2008-08-18, 2008-10-19, ~
$ host_location <chr> "Key Biscayne, Florida, U~
$ host_about <chr> "Isn’t sharing economy gr~
$ host_response_time <chr> "within an hour", "a few ~
$ host_response_rate <chr> "100%", "40%", "100%", "N~
$ host_acceptance_rate <chr> "91%", "100%", "N/A", "0%~
$ host_is_superhost <lgl> TRUE, FALSE, TRUE, FALSE,~
$ host_thumbnail_url <chr> "https://a0.muscache.com/~
$ host_picture_url <chr> "https://a0.muscache.com/~
$ host_neighbourhood <chr> "Mitte", "Prenzlauer Berg~
$ host_listings_count <dbl> 5, 1, 2, 1, 4, 4, 2, 1, 4~
$ host_total_listings_count <dbl> 5, 1, 2, 1, 4, 4, 2, 1, 4~
$ host_verifications <chr> "['email', 'phone', 'revi~
$ host_has_profile_pic <lgl> TRUE, TRUE, TRUE, TRUE, T~
$ host_identity_verified <lgl> FALSE, TRUE, TRUE, TRUE, ~
$ neighbourhood <chr> "Berlin, Germany", "Berli~
$ neighbourhood_cleansed <chr> "Brunnenstr. Süd", "Prenz~
$ neighbourhood_group_cleansed <chr> "Mitte", "Pankow", "Panko~
$ latitude <dbl> 52.53305, 52.53471, 52.54~
$ longitude <dbl> 13.40394, 13.41810, 13.41~
$ property_type <chr> "Entire guesthouse", "Ent~
$ room_type <chr> "Entire home/apt", "Entir~
$ accommodates <dbl> 2, 4, 2, 7, 1, 5, 2, 4, 4~
$ bathrooms <lgl> NA, NA, NA, NA, NA, NA, N~
$ bathrooms_text <chr> "1 bath", "1 bath", "1 sh~
$ bedrooms <dbl> 1, 1, 1, 4, NA, 1, NA, 2,~
$ beds <dbl> 0, 2, 2, 7, 1, 3, 0, 2, 2~
$ amenities <chr> "[\"Refrigerator\", \"Hea~
$ price <chr> "$77.00", "$90.00", "$33.~
$ minimum_nights <dbl> 90, 62, 1, 6, 90, 60, 5, ~
$ maximum_nights <dbl> 1125, 1125, 10, 14, 1125,~
$ minimum_minimum_nights <dbl> 33, 62, 1, 6, 90, 60, 5, ~
$ maximum_minimum_nights <dbl> 90, 62, 1, 6, 90, 60, 5, ~
$ minimum_maximum_nights <dbl> 1125, 1125, 10, 14, 1125,~
$ maximum_maximum_nights <dbl> 1125, 1125, 10, 14, 1125,~
$ minimum_nights_avg_ntm <dbl> 88.2, 62.0, 1.0, 6.0, 90.~
$ maximum_nights_avg_ntm <dbl> 1125.0, 1125.0, 10.0, 14.~
$ calendar_updated <lgl> NA, NA, NA, NA, NA, NA, N~
$ has_availability <lgl> TRUE, TRUE, TRUE, TRUE, T~
$ availability_30 <dbl> 0, 9, 0, 0, 0, 0, 0, 3, 0~
$ availability_60 <dbl> 21, 9, 0, 0, 1, 0, 4, 31,~
$ availability_90 <dbl> 51, 9, 0, 0, 31, 0, 4, 61~
$ availability_365 <dbl> 326, 93, 0, 0, 102, 144, ~
$ calendar_last_scraped <date> 2021-09-22, 2021-09-22, ~
$ number_of_reviews <dbl> 143, 147, 293, 8, 26, 48,~
$ number_of_reviews_ltm <dbl> 10, 1, 0, 0, 1, 0, 21, 2,~
$ number_of_reviews_l30d <dbl> 1, 0, 0, 0, 0, 0, 3, 0, 0~
$ first_review <date> 2016-04-11, 2010-12-21, ~
$ last_review <date> 2021-07-22, 2017-03-20, ~
$ review_scores_rating <dbl> 4.66, 4.63, 4.83, 5.00, 4~
$ review_scores_accuracy <dbl> 4.79, 4.68, 4.85, 5.00, 5~
$ review_scores_cleanliness <dbl> 4.52, 4.53, 4.90, 5.00, 4~
$ review_scores_checkin <dbl> 4.88, 4.64, 4.86, 5.00, 4~
$ review_scores_communication <dbl> 4.89, 4.69, 4.85, 5.00, 4~
$ review_scores_location <dbl> 4.96, 4.92, 4.91, 4.86, 4~
$ review_scores_value <dbl> 4.59, 4.63, 4.71, 4.86, 4~
$ license <chr> NA, NA, NA, "03/Z/RA/0034~
$ instant_bookable <lgl> FALSE, FALSE, TRUE, FALSE~
$ calculated_host_listings_count <dbl> 5, 1, 1, 1, 3, 2, 1, 1, 2~
$ calculated_host_listings_count_entire_homes <dbl> 5, 1, 0, 1, 3, 2, 1, 1, 2~
$ calculated_host_listings_count_private_rooms <dbl> 0, 0, 1, 0, 0, 0, 0, 0, 0~
$ calculated_host_listings_count_shared_rooms <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0~
$ reviews_per_month <dbl> 2.15, 1.12, 2.40, 0.16, 0~
Using “favstats”, we can get a feel for the values that individual variables take on. We chose “accommodates”, “review_scores_rating”, “number_of_reviews”, and “beds” because our intuive sense was that these could all impact price in our eventual regression model.
From “favstats”, we learn that the median for accommodates is 2, while the maximum goes up to 16. Also, the average Airbnb rental has a review score of c. 4.6. Finally, there is one Airbnb with 17 beds. These are just some exemplary figures from this descriptive analysis that help us to get a better feel for the data. Also notice that we cannot yet run the command on “price”, since it is still saved as a character variable.
Using “skim”, we can see that there are certain variables where many values are missing (e.g., host_about). It is good to see that “price”, our dependent variable in the regression model, is not missing for any of the rentals.
| min | Q1 | median | Q3 | max | mean | sd | n | missing | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 2 | 2 | 3 | 16 | 2.714129 | 1.619647 | 18288 | 0 |
| min | Q1 | median | Q3 | max | mean | sd | n | missing | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 4.61 | 4.85 | 5 | 5 | 4.626417 | 0.8043395 | 14716 | 3572 |
| min | Q1 | median | Q3 | max | mean | sd | n | missing | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 4 | 17 | 655 | 22.78904 | 51.01942 | 18288 | 0 |
| min | Q1 | median | Q3 | max | mean | sd | n | missing | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 1 | 2 | 17 | 1.624439 | 1.244291 | 18061 | 227 |
| skim_type | skim_variable | n_missing | complete_rate | character.min | character.max | character.empty | character.n_unique | character.whitespace | Date.min | Date.max | Date.median | Date.n_unique | logical.mean | logical.count | numeric.mean | numeric.sd | numeric.p0 | numeric.p25 | numeric.p50 | numeric.p75 | numeric.p100 | numeric.hist |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| character | listing_url | 0 | 1.0000000 | 33 | 37 | 0 | 18288 | 0 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| character | name | 29 | 0.9984143 | 1 | 255 | 0 | 17766 | 0 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| character | description | 544 | 0.9702537 | 1 | 1000 | 0 | 17156 | 0 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| character | neighborhood_overview | 8702 | 0.5241689 | 1 | 1000 | 0 | 8570 | 0 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| character | picture_url | 0 | 1.0000000 | 60 | 126 | 0 | 18047 | 0 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| character | host_url | 0 | 1.0000000 | 38 | 43 | 0 | 14776 | 0 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| character | host_name | 16 | 0.9991251 | 1 | 35 | 0 | 5177 | 0 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| character | host_location | 59 | 0.9967738 | 1 | 199 | 0 | 952 | 0 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| character | host_about | 9327 | 0.4899934 | 1 | 5095 | 0 | 6642 | 21 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| character | host_response_time | 16 | 0.9991251 | 3 | 18 | 0 | 5 | 0 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| character | host_response_rate | 16 | 0.9991251 | 2 | 4 | 0 | 66 | 0 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| character | host_acceptance_rate | 16 | 0.9991251 | 2 | 4 | 0 | 99 | 0 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| character | host_thumbnail_url | 16 | 0.9991251 | 55 | 106 | 0 | 14674 | 0 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| character | host_picture_url | 16 | 0.9991251 | 57 | 109 | 0 | 14674 | 0 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| character | host_neighbourhood | 6091 | 0.6669401 | 1 | 28 | 0 | 165 | 0 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| character | host_verifications | 0 | 1.0000000 | 2 | 158 | 0 | 318 | 0 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| character | neighbourhood | 8702 | 0.5241689 | 7 | 43 | 0 | 50 | 0 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| character | neighbourhood_cleansed | 0 | 1.0000000 | 4 | 41 | 0 | 137 | 0 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| character | neighbourhood_group_cleansed | 0 | 1.0000000 | 5 | 24 | 0 | 12 | 0 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| character | property_type | 0 | 1.0000000 | 3 | 35 | 0 | 68 | 0 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| character | room_type | 0 | 1.0000000 | 10 | 15 | 0 | 4 | 0 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| character | bathrooms_text | 26 | 0.9985783 | 6 | 17 | 0 | 27 | 0 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| character | amenities | 0 | 1.0000000 | 2 | 1416 | 0 | 15257 | 0 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| character | price | 0 | 1.0000000 | 5 | 9 | 0 | 430 | 0 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| character | license | 16019 | 0.1240704 | 3 | 342 | 0 | 1921 | 0 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| Date | last_scraped | 0 | 1.0000000 | NA | NA | NA | NA | NA | 2021-09-21 | 2021-10-03 | 2021-09-22 | 4 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| Date | host_since | 16 | 0.9991251 | NA | NA | NA | NA | NA | 2008-08-08 | 2021-09-20 | 2015-09-16 | 3562 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| Date | calendar_last_scraped | 0 | 1.0000000 | NA | NA | NA | NA | NA | 2021-09-21 | 2021-10-03 | 2021-09-22 | 4 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| Date | first_review | 3572 | 0.8046807 | NA | NA | NA | NA | NA | 2010-12-21 | 2021-09-22 | 2018-07-10 | 2771 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| Date | last_review | 3572 | 0.8046807 | NA | NA | NA | NA | NA | 2012-07-08 | 2021-09-26 | 2019-09-28 | 2226 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| logical | host_is_superhost | 16 | 0.9991251 | NA | NA | NA | NA | NA | NA | NA | NA | NA | 0.1545534 | FAL: 15448, TRU: 2824 | NA | NA | NA | NA | NA | NA | NA | NA |
| logical | host_has_profile_pic | 16 | 0.9991251 | NA | NA | NA | NA | NA | NA | NA | NA | NA | 0.9948555 | TRU: 18178, FAL: 94 | NA | NA | NA | NA | NA | NA | NA | NA |
| logical | host_identity_verified | 16 | 0.9991251 | NA | NA | NA | NA | NA | NA | NA | NA | NA | 0.7887478 | TRU: 14412, FAL: 3860 | NA | NA | NA | NA | NA | NA | NA | NA |
| logical | bathrooms | 18288 | 0.0000000 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NaN | : | NA | NA | NA | NA | NA | NA | NA | NA |
| logical | calendar_updated | 18288 | 0.0000000 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NaN | : | NA | NA | NA | NA | NA | NA | NA | NA |
| logical | has_availability | 0 | 1.0000000 | NA | NA | NA | NA | NA | NA | NA | NA | NA | 0.9803696 | TRU: 17929, FAL: 359 | NA | NA | NA | NA | NA | NA | NA | NA |
| logical | instant_bookable | 0 | 1.0000000 | NA | NA | NA | NA | NA | NA | NA | NA | NA | 0.3035324 | FAL: 12737, TRU: 5551 | NA | NA | NA | NA | NA | NA | NA | NA |
| numeric | id | 0 | 1.0000000 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 2.557156e+07 | 1.540011e+07 | 2.015000e+03 | 1.218794e+07 | 2.385470e+07 | 3.968697e+07 | 5.238006e+07 | <U+2587><U+2587><U+2587><U+2586><U+2587> |
| numeric | scrape_id | 0 | 1.0000000 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 2.021092e+13 | 0.000000e+00 | 2.021092e+13 | 2.021092e+13 | 2.021092e+13 | 2.021092e+13 | 2.021092e+13 | <U+2581><U+2581><U+2587><U+2581><U+2581> |
| numeric | host_id | 0 | 1.0000000 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 9.337946e+07 | 1.083088e+08 | 1.581000e+03 | 1.194556e+07 | 4.352120e+07 | 1.449065e+08 | 4.238179e+08 | <U+2587><U+2582><U+2581><U+2581><U+2581> |
| numeric | host_listings_count | 16 | 0.9991251 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 4.556042e+00 | 4.036450e+01 | 0.000000e+00 | 1.000000e+00 | 1.000000e+00 | 2.000000e+00 | 2.010000e+03 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| numeric | host_total_listings_count | 16 | 0.9991251 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 4.556042e+00 | 4.036450e+01 | 0.000000e+00 | 1.000000e+00 | 1.000000e+00 | 2.000000e+00 | 2.010000e+03 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| numeric | latitude | 0 | 1.0000000 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 5.250997e+01 | 3.244370e-02 | 5.234007e+01 | 5.248953e+01 | 5.250974e+01 | 5.253325e+01 | 5.265611e+01 | <U+2581><U+2581><U+2587><U+2583><U+2581> |
| numeric | longitude | 0 | 1.0000000 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 1.340509e+01 | 6.332170e-02 | 1.309715e+01 | 1.336797e+01 | 1.341485e+01 | 1.343918e+01 | 1.375736e+01 | <U+2581><U+2582><U+2587><U+2581><U+2581> |
| numeric | accommodates | 0 | 1.0000000 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 2.714129e+00 | 1.619647e+00 | 0.000000e+00 | 2.000000e+00 | 2.000000e+00 | 3.000000e+00 | 1.600000e+01 | <U+2587><U+2582><U+2581><U+2581><U+2581> |
| numeric | bedrooms | 1609 | 0.9120188 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 1.271779e+00 | 6.272113e-01 | 1.000000e+00 | 1.000000e+00 | 1.000000e+00 | 1.000000e+00 | 1.200000e+01 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| numeric | beds | 227 | 0.9875875 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 1.624439e+00 | 1.244291e+00 | 0.000000e+00 | 1.000000e+00 | 1.000000e+00 | 2.000000e+00 | 1.700000e+01 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| numeric | minimum_nights | 0 | 1.0000000 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 9.324256e+00 | 3.423886e+01 | 1.000000e+00 | 2.000000e+00 | 3.000000e+00 | 5.000000e+00 | 1.124000e+03 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| numeric | maximum_nights | 0 | 1.0000000 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 5.883037e+02 | 5.282968e+02 | 1.000000e+00 | 2.800000e+01 | 3.650000e+02 | 1.125000e+03 | 5.000000e+03 | <U+2587><U+2587><U+2581><U+2581><U+2581> |
| numeric | minimum_minimum_nights | 1 | 0.9999453 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 9.220867e+00 | 3.415314e+01 | 1.000000e+00 | 2.000000e+00 | 3.000000e+00 | 5.000000e+00 | 1.124000e+03 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| numeric | maximum_minimum_nights | 1 | 0.9999453 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 9.878657e+00 | 3.535246e+01 | 1.000000e+00 | 2.000000e+00 | 3.000000e+00 | 5.000000e+00 | 1.124000e+03 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| numeric | minimum_maximum_nights | 1 | 0.9999453 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 4.704186e+05 | 3.175798e+07 | 1.000000e+00 | 3.000000e+01 | 1.125000e+03 | 1.125000e+03 | 2.147484e+09 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| numeric | maximum_maximum_nights | 1 | 0.9999453 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 5.878616e+05 | 3.550553e+07 | 1.000000e+00 | 3.000000e+01 | 1.125000e+03 | 1.125000e+03 | 2.147484e+09 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| numeric | minimum_nights_avg_ntm | 1 | 0.9999453 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 9.597167e+00 | 3.451717e+01 | 1.000000e+00 | 2.000000e+00 | 3.000000e+00 | 5.000000e+00 | 1.124000e+03 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| numeric | maximum_nights_avg_ntm | 1 | 0.9999453 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 5.875923e+05 | 3.548948e+07 | 1.000000e+00 | 3.000000e+01 | 1.125000e+03 | 1.125000e+03 | 2.147484e+09 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| numeric | availability_30 | 0 | 1.0000000 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 3.806758e+00 | 7.721591e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 3.000000e+00 | 3.000000e+01 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| numeric | availability_60 | 0 | 1.0000000 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 1.019412e+01 | 1.775509e+01 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 1.500000e+01 | 6.000000e+01 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| numeric | availability_90 | 0 | 1.0000000 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 1.831753e+01 | 2.927484e+01 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 3.500000e+01 | 9.000000e+01 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| numeric | availability_365 | 0 | 1.0000000 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 8.556086e+01 | 1.245070e+02 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 1.620000e+02 | 3.650000e+02 | <U+2587><U+2581><U+2581><U+2581><U+2582> |
| numeric | number_of_reviews | 0 | 1.0000000 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 2.278904e+01 | 5.101942e+01 | 0.000000e+00 | 1.000000e+00 | 4.000000e+00 | 1.700000e+01 | 6.550000e+02 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| numeric | number_of_reviews_ltm | 0 | 1.0000000 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 2.679899e+00 | 9.356744e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 2.000000e+00 | 4.470000e+02 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| numeric | number_of_reviews_l30d | 0 | 1.0000000 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 4.519904e-01 | 1.654162e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 1.040000e+02 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| numeric | review_scores_rating | 3572 | 0.8046807 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 4.626417e+00 | 8.043395e-01 | 0.000000e+00 | 4.610000e+00 | 4.850000e+00 | 5.000000e+00 | 5.000000e+00 | <U+2581><U+2581><U+2581><U+2581><U+2587> |
| numeric | review_scores_accuracy | 3897 | 0.7869094 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 4.791855e+00 | 4.112054e-01 | 0.000000e+00 | 4.750000e+00 | 4.920000e+00 | 5.000000e+00 | 5.000000e+00 | <U+2581><U+2581><U+2581><U+2581><U+2587> |
| numeric | review_scores_cleanliness | 3895 | 0.7870188 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 4.637258e+00 | 5.258907e-01 | 0.000000e+00 | 4.500000e+00 | 4.800000e+00 | 5.000000e+00 | 5.000000e+00 | <U+2581><U+2581><U+2581><U+2581><U+2587> |
| numeric | review_scores_checkin | 3909 | 0.7862533 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 4.826007e+00 | 3.900019e-01 | 0.000000e+00 | 4.800000e+00 | 4.960000e+00 | 5.000000e+00 | 5.000000e+00 | <U+2581><U+2581><U+2581><U+2581><U+2587> |
| numeric | review_scores_communication | 3898 | 0.7868548 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 4.828607e+00 | 3.978867e-01 | 0.000000e+00 | 4.810000e+00 | 4.970000e+00 | 5.000000e+00 | 5.000000e+00 | <U+2581><U+2581><U+2581><U+2581><U+2587> |
| numeric | review_scores_location | 3908 | 0.7863080 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 4.759599e+00 | 3.838505e-01 | 0.000000e+00 | 4.670000e+00 | 4.880000e+00 | 5.000000e+00 | 5.000000e+00 | <U+2581><U+2581><U+2581><U+2581><U+2587> |
| numeric | review_scores_value | 3910 | 0.7861986 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 4.668290e+00 | 4.501632e-01 | 0.000000e+00 | 4.550000e+00 | 4.760000e+00 | 5.000000e+00 | 5.000000e+00 | <U+2581><U+2581><U+2581><U+2581><U+2587> |
| numeric | calculated_host_listings_count | 0 | 1.0000000 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 3.025153e+00 | 7.454440e+00 | 1.000000e+00 | 1.000000e+00 | 1.000000e+00 | 2.000000e+00 | 7.600000e+01 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| numeric | calculated_host_listings_count_entire_homes | 0 | 1.0000000 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 1.942257e+00 | 5.416078e+00 | 0.000000e+00 | 0.000000e+00 | 1.000000e+00 | 1.000000e+00 | 4.400000e+01 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| numeric | calculated_host_listings_count_private_rooms | 0 | 1.0000000 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 8.859361e-01 | 3.247792e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 1.000000e+00 | 4.500000e+01 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| numeric | calculated_host_listings_count_shared_rooms | 0 | 1.0000000 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 1.392170e-01 | 2.017200e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 3.800000e+01 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| numeric | reviews_per_month | 3572 | 0.8046807 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 8.155416e-01 | 1.577983e+00 | 1.000000e-02 | 9.000000e-02 | 3.000000e-01 | 1.000000e+00 | 9.086000e+01 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
In a next step, we transform “price” and some of the other variables into numerics. Also, we use “ggpairs” to get a feel for the correlation between some of the variables. For instance, it is interesting to find out whether “accommodates” correlates with “minimum_nights”. Our intuition was that very large Airbnbs may have a higher minimum_nights number, since the cleaning effort for the host is increased.
The output below indicates that this intuition is not confirmed by the data, since there is actually a slightly negative correlation between minimum_nights and accommodates. As one would expect, a higher number of accommodates is correlated with a higher price. The density plots also help us see that for example review_scores_rating is left-skewed with a large number of rentals having very high ratings. Another interesting observation is that maximum_nights has a peak at 365, which means that many rentals cannot be booked for more than a year. This may be due to regulatory reasons, which keeps hosts to from renting out their properties for very long periods of time.
These are some other questions that we can now answer
How many variables/columns? How many rows/observations?
There are 74 variables and 18,288 observations.
Which variables are numbers?
The following variables are numbers: id, scrape_id, host_id, latitude, longitude, accommondates, bathrooms, bedrooms, beds, price, maximum_nights, minimum_nights, number_of_reviews, number_of_reviews_ltm, number_of_reviews_130d, reviews_per_month,calculated_host_listings_count, calculated_host_listings_count_entire_homes, calculated_host_listings_count_private_rooms, calculated_host_listings_count_shared_rooms, reviews_per_month;
Which are categorical or factor variables (numeric or character variables with variables that have a fixed and known set of possible values?)
The following variables are factors: host_response_rate, host_acceptance_rate, host_is_superhost, host_has_profile_pic, host_identity_verified, review_scores_rating, review_scores_accuracy, review_scores_cleanliness, review_scores_checkin, review_scores_communication, review_scores_location, review_scores_value, instant_bookable;
What are the correlations between variables? Does each scatterplot support a linear relationship between variables? Do any of the correlations appear to be conditional on the value of a categorical variable?
We were not able to observe strong correlation between any of the variables we selected for testing. It therefore appears that there is no linear relationship between the price, the accommodates, number of reviews, review scores rating, maximum or minimum nights. We log-transform the price variable at a later stage in order to normalize higher dispersion in very expensive rentals.
We are now at the third step of the Exploratory Data Analysis section.
In this step, we plot some graphs in order to deepen our understanding of how different variables are distributed. We do not exclusively focus on variables and relationships that may impact price in our regression model, but rather try to get a feel of the dataset in general.
In the first chart, we learn that the distribution of beds varies with the nr. of accommodates of a specific rental. This is a rather straight-forward relationship, but it helps to start with something that confirms the intuition. In general, the interquartile range increases with the nr. of accommodates per Airbnb. One can assume that this is due to any extra beds in the form of sofa beds, which are likely more frequent in larger rentals. These more “improvised” beds are less likely to be found in smaller rentals.
The second chart tells us that superhosts (those with many rentals and a lot of experience) have a higher median review rating and a smaller interquartile range. One can assume that superhosts more consistently provide a high quality rental experience and therefore the spread of different ratings is smaller. We can also see that there are certain rentals for which the data set does not provide information on host status (“NA”).
The third chart shows that review ratings among different room types vary. Shared rooms tend to have the worst ratings, which is likely due to the fact that the rental experience is dependent on another visitor.
The fourth chart shows the availability of rentals in different neighbourhoods. For example, in “Mitte” the availability is a lot lower than in Spandau. This is likely due to the fact that Mitte is in a very central location, where the demand for Airbnbs is really high.
For the fifth chart, we filter out all the rentals that have a price >400 to avoid the distorting effect of very expensive rentals. In a later step, we will log-transform the price variable to achieve this. For now, the chart tells us that different room types have different price distributions. The hotel room category, where you also pay for using the amenities of the respective hotel, is unsurprisingly the most expensive one. What is more interesting is that shared rooms and private rooms have very similar distributions. One reason could be that shared rooms are a lot larger, which makes up for the lack of privacy in terms of price.
From the sixth chart, we learn that whether a host has a profile picture seems to impact the communication rating for a specific rental. Hosts that have a picture tend to score higher in this category. After all, Airbnb customers seem to like to see who their host is and incorporate that into the communication rating they give.
Now, we focus on getting our data set in the right format for our regression analysis.
First, we look at the variable property_type. We can use the count function to determine how many categories there are and their frequency. The four most common property types are entire rental units (~50.0%), private rooms in rental units (~35.7%), entire condominiums (~2.7%), and entire serviced apartments (~2.0%). Together, these property types make up for ~90.3% of the whole sample.
| property_type | count | prop_in_percentage |
|---|---|---|
| Entire rental unit | 8778 | 47.9986877 |
| Private room in rental unit | 6534 | 35.7283465 |
| Entire condominium (condo) | 485 | 2.6520122 |
| Entire serviced apartment | 362 | 1.9794401 |
| Entire loft | 327 | 1.7880577 |
| Private room in residential home | 237 | 1.2959318 |
| Private room in condominium (condo) | 219 | 1.1975066 |
| Entire residential home | 183 | 1.0006562 |
| Room in hotel | 175 | 0.9569116 |
| Shared room in rental unit | 117 | 0.6397638 |
| Room in boutique hotel | 96 | 0.5249344 |
| Private room in loft | 80 | 0.4374453 |
| Shared room in hostel | 75 | 0.4101050 |
| Private room in bed and breakfast | 68 | 0.3718285 |
| Entire guesthouse | 56 | 0.3062117 |
| Private room in townhouse | 55 | 0.3007437 |
| Private room in hostel | 48 | 0.2624672 |
| Room in serviced apartment | 47 | 0.2569991 |
| Entire guest suite | 32 | 0.1749781 |
| Room in aparthotel | 31 | 0.1695101 |
| Private room in serviced apartment | 29 | 0.1585739 |
| Entire bungalow | 25 | 0.1367017 |
| Entire townhouse | 24 | 0.1312336 |
| Houseboat | 18 | 0.0984252 |
| Private room | 18 | 0.0984252 |
| Private room in guest suite | 13 | 0.0710849 |
| Private room in pension | 13 | 0.0710849 |
| Entire place | 10 | 0.0546807 |
| Boat | 9 | 0.0492126 |
| Camper/RV | 9 | 0.0492126 |
| Private room in guesthouse | 9 | 0.0492126 |
| Room in hostel | 9 | 0.0492126 |
| Private room in villa | 8 | 0.0437445 |
| Tiny house | 8 | 0.0437445 |
| Entire villa | 7 | 0.0382765 |
| Entire cabin | 6 | 0.0328084 |
| Private room in casa particular | 6 | 0.0328084 |
| Private room in tiny house | 5 | 0.0273403 |
| Shared room in condominium (condo) | 5 | 0.0273403 |
| Entire cottage | 4 | 0.0218723 |
| Private room in boat | 4 | 0.0218723 |
| Private room in bungalow | 4 | 0.0218723 |
| Shared room in boutique hotel | 4 | 0.0218723 |
| Shared room in loft | 3 | 0.0164042 |
| Shared room in residential home | 3 | 0.0164042 |
| Entire home/apt | 2 | 0.0109361 |
| Private room in cottage | 2 | 0.0109361 |
| Room in bed and breakfast | 2 | 0.0109361 |
| Shared room in bed and breakfast | 2 | 0.0109361 |
| Shared room in serviced apartment | 2 | 0.0109361 |
| Shared room in tiny house | 2 | 0.0109361 |
| Treehouse | 2 | 0.0109361 |
| Bus | 1 | 0.0054681 |
| Casa particular | 1 | 0.0054681 |
| Castle | 1 | 0.0054681 |
| Earth house | 1 | 0.0054681 |
| Entire chalet | 1 | 0.0054681 |
| Floor | 1 | 0.0054681 |
| Island | 1 | 0.0054681 |
| Private room in cave | 1 | 0.0054681 |
| Private room in floor | 1 | 0.0054681 |
| Private room in houseboat | 1 | 0.0054681 |
| Private room in tipi | 1 | 0.0054681 |
| Shared room | 1 | 0.0054681 |
| Shared room in boat | 1 | 0.0054681 |
| Shared room in cabin | 1 | 0.0054681 |
| Shared room in townhouse | 1 | 0.0054681 |
| Shipping container | 1 | 0.0054681 |
Since the vast majority of the observations in the data are one of the top four or five property types, we would like to create a simplified version of property_type variable that has 5 categories: the top four categories and Other.
We can quickly check if the simplification worked.
| property_type | prop_type_simplified | n |
|---|---|---|
| Entire rental unit | Entire rental unit | 8778 |
| Private room in rental unit | Private room in rental unit | 6534 |
| Entire condominium (condo) | Entire condominium (condo) | 485 |
| Entire serviced apartment | Entire serviced apartment | 362 |
| Entire loft | Other | 327 |
| Private room in residential home | Other | 237 |
| Private room in condominium (condo) | Other | 219 |
| Entire residential home | Other | 183 |
| Room in hotel | Other | 175 |
| Shared room in rental unit | Other | 117 |
| Room in boutique hotel | Other | 96 |
| Private room in loft | Other | 80 |
| Shared room in hostel | Other | 75 |
| Private room in bed and breakfast | Other | 68 |
| Entire guesthouse | Other | 56 |
| Private room in townhouse | Other | 55 |
| Private room in hostel | Other | 48 |
| Room in serviced apartment | Other | 47 |
| Entire guest suite | Other | 32 |
| Room in aparthotel | Other | 31 |
| Private room in serviced apartment | Other | 29 |
| Entire bungalow | Other | 25 |
| Entire townhouse | Other | 24 |
| Houseboat | Other | 18 |
| Private room | Other | 18 |
| Private room in guest suite | Other | 13 |
| Private room in pension | Other | 13 |
| Entire place | Other | 10 |
| Boat | Other | 9 |
| Camper/RV | Other | 9 |
| Private room in guesthouse | Other | 9 |
| Room in hostel | Other | 9 |
| Private room in villa | Other | 8 |
| Tiny house | Other | 8 |
| Entire villa | Other | 7 |
| Entire cabin | Other | 6 |
| Private room in casa particular | Other | 6 |
| Private room in tiny house | Other | 5 |
| Shared room in condominium (condo) | Other | 5 |
| Entire cottage | Other | 4 |
| Private room in boat | Other | 4 |
| Private room in bungalow | Other | 4 |
| Shared room in boutique hotel | Other | 4 |
| Shared room in loft | Other | 3 |
| Shared room in residential home | Other | 3 |
| Entire home/apt | Other | 2 |
| Private room in cottage | Other | 2 |
| Room in bed and breakfast | Other | 2 |
| Shared room in bed and breakfast | Other | 2 |
| Shared room in serviced apartment | Other | 2 |
| Shared room in tiny house | Other | 2 |
| Treehouse | Other | 2 |
| Bus | Other | 1 |
| Casa particular | Other | 1 |
| Castle | Other | 1 |
| Earth house | Other | 1 |
| Entire chalet | Other | 1 |
| Floor | Other | 1 |
| Island | Other | 1 |
| Private room in cave | Other | 1 |
| Private room in floor | Other | 1 |
| Private room in houseboat | Other | 1 |
| Private room in tipi | Other | 1 |
| Shared room | Other | 1 |
| Shared room in boat | Other | 1 |
| Shared room in cabin | Other | 1 |
| Shared room in townhouse | Other | 1 |
| Shipping container | Other | 1 |
Next, we look at the Minimum_nihts variabe to only include listings in our regression analysis that are intended for travel purposes. At first, we check the distribution of minimum_nights.
| minimum_nights | count |
|---|---|
| 2 | 4236 |
| 1 | 4194 |
| 3 | 3282 |
| 4 | 1368 |
| 5 | 1293 |
| 7 | 864 |
| 30 | 418 |
| 6 | 382 |
| 14 | 363 |
| 60 | 298 |
| 10 | 284 |
| 90 | 195 |
| 20 | 117 |
| 28 | 95 |
| 15 | 82 |
| 8 | 75 |
| 21 | 72 |
| 180 | 53 |
| 12 | 43 |
| 9 | 39 |
| 25 | 37 |
| 13 | 33 |
| 29 | 32 |
| 61 | 31 |
| 62 | 28 |
| 22 | 26 |
| 120 | 23 |
| 31 | 17 |
| 183 | 16 |
| 45 | 14 |
| 93 | 13 |
| 18 | 11 |
| 150 | 11 |
| 16 | 10 |
| 89 | 10 |
| 91 | 10 |
| 40 | 9 |
| 58 | 9 |
| 100 | 9 |
| 357 | 9 |
| 11 | 7 |
| 19 | 7 |
| 50 | 7 |
| 56 | 7 |
| 365 | 7 |
| 23 | 6 |
| 27 | 6 |
| 181 | 6 |
| 65 | 5 |
| 92 | 5 |
| 200 | 5 |
| 1000 | 5 |
| 17 | 4 |
| 55 | 4 |
| 63 | 4 |
| 70 | 4 |
| 80 | 4 |
| 85 | 4 |
| 300 | 4 |
| 24 | 3 |
| 42 | 3 |
| 59 | 3 |
| 99 | 3 |
| 118 | 3 |
| 182 | 3 |
| 186 | 3 |
| 500 | 3 |
| 1124 | 3 |
| 26 | 2 |
| 33 | 2 |
| 35 | 2 |
| 83 | 2 |
| 84 | 2 |
| 140 | 2 |
| 185 | 2 |
| 240 | 2 |
| 360 | 2 |
| 34 | 1 |
| 37 | 1 |
| 48 | 1 |
| 49 | 1 |
| 51 | 1 |
| 71 | 1 |
| 75 | 1 |
| 82 | 1 |
| 87 | 1 |
| 88 | 1 |
| 98 | 1 |
| 101 | 1 |
| 105 | 1 |
| 119 | 1 |
| 122 | 1 |
| 125 | 1 |
| 128 | 1 |
| 129 | 1 |
| 170 | 1 |
| 179 | 1 |
| 184 | 1 |
| 187 | 1 |
| 188 | 1 |
| 210 | 1 |
| 250 | 1 |
| 270 | 1 |
| 304 | 1 |
| 355 | 1 |
| 356 | 1 |
| 720 | 1 |
| 1100 | 1 |
We can now answer some more questions
What are the most common values for the variable minimum_nights?
The most common values for the variable minimum_nights are 2, 1, and 3 nights. This answer also makes sense, given many people use Airbnb for city trips, so the mininmal duration should not be too limited, but short stays and the cost or work to clean an Airbnb for a one night booking might not be worth it for many hosts.
Is there any value among the common values that stands out?
Especially the 30, 14 and 60 night minimum limits stand out at a first glance. These are usually longer-term Airbnbs that are used by interns or workers that are on assembly trips. It is also logical for some landlords to rent out their rooms over the longer term, as also for a longer stay the room only has to be tided once. The highest minimum night requirement is 1,124 nights. This observation must be investigated further to understand the reason behind such a high value.
What is the likely intended purpose for Airbnb listings with this seemingly unusual value for minimum_nights?
The usual reasons for these longer minimum stays are to draw bookings from people that are on work projects, internships or are looking for a temporary stay while looking for a permanent accommodation. The benefit for the host is the lower frequency of cleaning and setting up the rooms.
Next, we filter the airbnb data so that it only includes observations with minimum_nights <= 4.
After making these adjustments, we want to analyze the distribution of rentals in Berlin. As the chart below shows, there are certain quarters with particularly many rentals. For instance, in Kreuzberg (a southern quarter in the city), there are many rentals available. This may be due to the types of buildings and the general infrastructure in the area. Kreuzberg is home to many restaurants and bars, which makes it an interesting area for tourists. Interestingly, there are fewer Airbnb in the heart of the city. Likely this is because the political district as well as many high-end hotels are located here, which leaves less room for Airbnbs.
#data visualization that assigns each rental to a specific map location using longitude and latitude figures
leaflet(data = filter(listings, minimum_nights <= 4)) %>%
addProviderTiles("OpenStreetMap.Mapnik") %>%
addCircleMarkers(lng = ~longitude,
lat = ~latitude,
radius = 0.5,
fillColor = "red",
fillOpacity = 0.3,
popup = ~listing_url,
label = ~property_type)As we get closer to our regression model, we create a new variable called price_4_nights that uses price, and accomodates to calculate the total cost for two people to stay at the Airbnb property for 4 nights. This is the variable \(Y\) we want to explain.
In the next section, we create a new column called “log(price_4_nights)”. We should use log(price_4_nights) because there are some outlier dentals in price_4_nights and using log(price_4_nights) could help normalize the dataset. In addition, the use of log can make the distribution behave better and help with finding the regression model. The regression model assumes normality and running a log-transformation helps to come closer to this assumption. It also ensures that the assumption of constant variance is met.
We can use histograms to examine the distributions of price_4_nights and log(price_4_nights).
We now have all variables in the correct format and can start model selection and validation.We start with a model called model1 with the following explanatory variables: prop_type_simplified, number_of_reviews, and review_scores_rating.
Estimate Std. Error t value
(Intercept) 5.2822983 0.0449473 117.522
prop_type_simplifiedEntire rental unit -0.1427983 0.0318703 -4.481
prop_type_simplifiedEntire serviced apartment 0.4198925 0.0464712 9.036
prop_type_simplifiedOther -0.1408665 0.0342009 -4.119
prop_type_simplifiedPrivate room in rental unit -0.5365301 0.0321607 -16.683
number_of_reviews -0.0002594 0.0000804 -3.227
review_scores_rating 0.0426309 0.0068832 6.194
Pr(>|t|)
(Intercept) < 2e-16 ***
prop_type_simplifiedEntire rental unit 7.53e-06 ***
prop_type_simplifiedEntire serviced apartment < 2e-16 ***
prop_type_simplifiedOther 3.84e-05 ***
prop_type_simplifiedPrivate room in rental unit < 2e-16 ***
number_of_reviews 0.00126 **
review_scores_rating 6.12e-10 ***
Residual standard error: 0.4882 on 9866 degrees of freedom
Multiple R-squared: 0.1648, Adjusted R-squared: 0.1643
F-statistic: 324.4 on 6 and 9866 DF, p-value: < 2.2e-16
GVIF Df GVIF^(1/(2*Df))
prop_type_simplified 1.011769 4 1.001464
number_of_reviews 1.015923 1 1.007930
review_scores_rating 1.005692 1 1.002842
Because the dependent variable (i.e., price_4_nights) is log-transformed, the interpretation of the coefficients requires one additional step. The coefficient has to be exponentiated to reverse the log-transformation: (e^0.0426309-1)*100=4.3553. This adjusted coefficient means that for every unit change in review_scores_rating, the price_4_nights increases by about 4.4%. This makes intuitive sense: the higher the rating, the more the host can charge. The t-value of >6 indicates that this relationship is statistically significant.
To interpret the coefficients, they have to be transformed like in the previous section. This leads to the following values:
prop_type_simplifiedEntire rental unit: -13.3071 prop_type_simplifiedEntire serviced apartment: 52.17979 prop_type_simplifiedOther:-13.1395 prop_type_simplifiedPrivate room in rental unit: -41.5226
The variable “Entire condominium (condo)” is taken as the base value. Hence, the coefficients correspond to the %-change in price_4_nights over the base case that the Airbnb is of prop_type “Entire condominium (condo)”. For instance, if you rent an “Entire serviced apartment”, the price_4_nights is increased by 52% over the price that it would cost you if you had rented an “Entire condominium (condo)”. The same logic also applies to the other variables, which are also all statistically significant. It also makes intuitive sense that for example “Entire serviced apartments” will be significantly more costly, because you pay for amenities such as regular cleaning or even breakfast. In a further analysis, one could split up the “Other” category further, to find out more about other property types.
Next, we want to determine if room_type is a significant predictor of the cost for 4 nights, given everything else in the model. We fit a regression model called model2 that includes all of the explanatory variables in model1 plus room_type.
| room_type | count |
|---|---|
| Entire home/apt | 5627 |
| Private room | 4061 |
| Shared room | 93 |
| Hotel room | 92 |
Estimate Std. Error t value
(Intercept) 5.319e+00 4.328e-02 122.882
prop_type_simplifiedEntire rental unit -1.433e-01 3.068e-02 -4.673
prop_type_simplifiedEntire serviced apartment 4.187e-01 4.473e-02 9.361
prop_type_simplifiedOther 3.023e-02 3.728e-02 0.811
prop_type_simplifiedPrivate room in rental unit -2.549e-01 4.317e-02 -5.905
number_of_reviews -2.411e-04 7.746e-05 -3.112
review_scores_rating 3.479e-02 6.631e-03 5.247
room_typeHotel room 6.465e-01 5.384e-02 12.009
room_typePrivate room -2.819e-01 3.015e-02 -9.352
room_typeShared room -1.171e+00 5.360e-02 -21.841
Pr(>|t|)
(Intercept) < 2e-16 ***
prop_type_simplifiedEntire rental unit 3.01e-06 ***
prop_type_simplifiedEntire serviced apartment < 2e-16 ***
prop_type_simplifiedOther 0.41740
prop_type_simplifiedPrivate room in rental unit 3.65e-09 ***
number_of_reviews 0.00186 **
review_scores_rating 1.58e-07 ***
room_typeHotel room < 2e-16 ***
room_typePrivate room < 2e-16 ***
room_typeShared room < 2e-16 ***
Residual standard error: 0.4699 on 9863 degrees of freedom
Multiple R-squared: 0.2265, Adjusted R-squared: 0.2258
F-statistic: 320.9 on 9 and 9863 DF, p-value: < 2.2e-16
GVIF Df GVIF^(1/(2*Df))
prop_type_simplified 11.354524 4 1.354865
number_of_reviews 1.017942 1 1.008931
review_scores_rating 1.007537 1 1.003761
room_type 11.342129 3 1.498934
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.227e+00 3.220e-02 162.336 < 2e-16 ***
number_of_reviews -2.131e-04 7.924e-05 -2.690 0.00716 **
review_scores_rating 3.202e-02 6.792e-03 4.715 2.46e-06 ***
room_typeHotel room 7.806e-01 5.063e-02 15.418 < 2e-16 ***
room_typePrivate room -3.956e-01 9.958e-03 -39.725 < 2e-16 ***
room_typeShared room -1.038e+00 5.038e-02 -20.604 < 2e-16 ***
Residual standard error: 0.4816 on 9867 degrees of freedom
Multiple R-squared: 0.1873, Adjusted R-squared: 0.1869
F-statistic: 454.7 on 5 and 9867 DF, p-value: < 2.2e-16
GVIF Df GVIF^(1/(2*Df))
number_of_reviews 1.014175 1 1.007062
review_scores_rating 1.006402 1 1.003196
room_type 1.010664 3 1.001770
There is some multicollinearity between room_type and property_type, as one would expect. Because room_type adds more explanatory power to the model, we therefore exclude property_type from the model. All room_type variables are statistically significant and tell us different things about price_4_nights:
We now go on by adding other variables to the model to increase its explanatory power. Currently, we can only explain c. 19% of the variation in price with our model. We therefore include more variables to improve on this. Model3 includes the number of bathrooms, bedrooms, beds, and size of the house (accomodates) of a rental.
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.404e+00 3.533e-02 152.975 < 2e-16 ***
number_of_reviews 8.847e-05 7.944e-05 1.114 0.265
review_scores_rating 2.835e-02 6.977e-03 4.063 4.89e-05 ***
room_typeHotel room 7.528e-01 4.879e-02 15.429 < 2e-16 ***
room_typePrivate room -4.816e-01 1.064e-02 -45.262 < 2e-16 ***
room_typeShared room -8.967e-01 4.964e-02 -18.065 < 2e-16 ***
bedrooms 1.729e-01 1.098e-02 15.737 < 2e-16 ***
beds 9.605e-03 5.892e-03 1.630 0.103
accommodates -1.239e-01 5.405e-03 -22.922 < 2e-16 ***
Residual standard error: 0.4581 on 9058 degrees of freedom
Multiple R-squared: 0.2611, Adjusted R-squared: 0.2605
F-statistic: 400.2 on 8 and 9058 DF, p-value: < 2.2e-16
GVIF Df GVIF^(1/(2*Df))
number_of_reviews 1.032268 1 1.016006
review_scores_rating 1.007340 1 1.003664
room_type 1.254522 3 1.038516
bedrooms 2.279887 1 1.509930
beds 3.014537 1 1.736242
accommodates 3.928598 1 1.982069
Based on this model, we learn that bedrooms and the size of the house are significant predictors of price_4_nights, which can be seen from a high absolute t-statistic. As the nr. of bedrooms increases, the price of the rental also increases. As house size increases, the price per person actually decreases (remember that we divided by “accommodates” when adjusting the price_4_nights variable). This makes sense, since the price is then shared among a greater number of heads. Beds is not a statistically significant predictor of price_4_nights. Interestingly, there is some multicollinearity between bedrooms, beds, and accommodates but not enough to disregard the model.
Comparing Model3 to Model2, we increase the adjusted R-squared to 0.26, which means that we can now explain more than a quarter of the variation in price. In Model4, we add the impact of the superhost variable (host_is_superhost) and check whether they can command a pricing premium, after controlling for other variables.
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.426e+00 3.529e-02 153.769 < 2e-16 ***
number_of_reviews -1.552e-04 8.479e-05 -1.831 0.06717 .
review_scores_rating 2.191e-02 6.998e-03 3.131 0.00175 **
room_typeHotel room 7.441e-01 4.859e-02 15.314 < 2e-16 ***
room_typePrivate room -4.824e-01 1.060e-02 -45.508 < 2e-16 ***
room_typeShared room -8.731e-01 4.952e-02 -17.631 < 2e-16 ***
bedrooms 1.733e-01 1.094e-02 15.840 < 2e-16 ***
beds 9.311e-03 5.867e-03 1.587 0.11255
accommodates -1.253e-01 5.385e-03 -23.272 < 2e-16 ***
host_is_superhostTRUE 1.028e-01 1.308e-02 7.866 4.09e-15 ***
Residual standard error: 0.4561 on 9048 degrees of freedom
(9 observations deleted due to missingness)
Multiple R-squared: 0.2667, Adjusted R-squared: 0.266
F-statistic: 365.7 on 9 and 9048 DF, p-value: < 2.2e-16
GVIF Df GVIF^(1/(2*Df))
number_of_reviews 1.186099 1 1.089082
review_scores_rating 1.022240 1 1.011059
room_type 1.260387 3 1.039323
bedrooms 2.280320 1 1.510073
beds 3.014382 1 1.736198
accommodates 3.933709 1 1.983358
host_is_superhost 1.190880 1 1.091274
Based on this model, superhosts charge a pricing premium, which can be seen from the positive coefficient and the high t-statistic. This makes sense, since these kinds of hosts are typically very professional in the way that they manage their apartments, which translates into higher customer value and thereby the ability to charge higher prices.
For Model5, we include the fact that some hosts allow you to immediately book their listing (instant_bookable == TRUE), while a non-trivial proportion don’t.
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.4034933 0.0354634 152.368 < 2e-16 ***
number_of_reviews -0.0002050 0.0000851 -2.409 0.016037 *
review_scores_rating 0.0241972 0.0069981 3.458 0.000547 ***
room_typeHotel room 0.7049356 0.0489961 14.388 < 2e-16 ***
room_typePrivate room -0.4832893 0.0105825 -45.669 < 2e-16 ***
room_typeShared room -0.8717889 0.0494345 -17.635 < 2e-16 ***
bedrooms 0.1750736 0.0109243 16.026 < 2e-16 ***
beds 0.0092061 0.0058568 1.572 0.116021
accommodates -0.1274219 0.0053890 -23.645 < 2e-16 ***
host_is_superhostTRUE 0.0998504 0.0130642 7.643 2.34e-14 ***
instant_bookableTRUE 0.0589553 0.0104401 5.647 1.68e-08 ***
Residual standard error: 0.4553 on 9047 degrees of freedom
(9 observations deleted due to missingness)
Multiple R-squared: 0.2693, Adjusted R-squared: 0.2685
F-statistic: 333.4 on 10 and 9047 DF, p-value: < 2.2e-16
GVIF Df GVIF^(1/(2*Df))
number_of_reviews 1.198945 1 1.094963
review_scores_rating 1.025670 1 1.012754
room_type 1.286166 3 1.042836
bedrooms 2.282299 1 1.510728
beds 3.014412 1 1.736206
accommodates 3.952343 1 1.988050
host_is_superhost 1.192851 1 1.092177
instant_bookable 1.056667 1 1.027943
As can be seen from the summary statistics, the variable “instant_bookable” is also a significant predictor of price_4_nights. The regression analysis reveals that when controlling for the other listed variables, a rental with an instant-booking option is c. 6.07% more expensive than one without. The customer pays a premium for instant confirmation that the rental can be booked. The t-statistic for the variable is high and there is little multicollinearity with other variables, which is why it should be kept in the model.
For Model6, we look at neighbourhoods. For all cities, there are 3 variables that relate to neighbourhoods: neighbourhood, neighbourhood_cleansed, and neighbourhood_group_cleansed. There are typically more than 20 neighbourhoods in each city, and it wouldn’t make sense to include them all in the model. Instead, we manipulate the neighbourhood_group_cleansed variable and divide neighbourhoods into the following 4 groups:
City West: Steglitz - Zehlendorf, Spandau, Charlottenburg-Wilm. City North: Reinickendorf, Pankow, Lichtenberg City Central: Mitte, Friedrichshain-Kreuzberg City East: Marzahn - Hellersdorf, Treptow - Köpenick, Neukölln, Tempelhof - Schöneberg
This grouping is based on (i) the geographic location of the neighbourhoods and (ii) the judgement of a Berlin local. It pays special consideration for the particularly sought-after quarters of “Mitte” and “Friedrichshain-Kreuzberg”, which create their own group.
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.4698939 0.0354801 154.168 < 2e-16 ***
number_of_reviews -0.0002698 0.0000843 -3.201 0.001375 **
review_scores_rating 0.0240103 0.0069220 3.469 0.000525 ***
room_typeHotel room 0.6984246 0.0484656 14.411 < 2e-16 ***
room_typePrivate room -0.4802182 0.0104892 -45.782 < 2e-16 ***
room_typeShared room -0.9034713 0.0489465 -18.458 < 2e-16 ***
bedrooms 0.1780691 0.0108096 16.473 < 2e-16 ***
beds 0.0099778 0.0058002 1.720 0.085419 .
accommodates -0.1288824 0.0053389 -24.140 < 2e-16 ***
host_is_superhostTRUE 0.0997023 0.0129309 7.710 1.39e-14 ***
instant_bookableTRUE 0.0599617 0.0103274 5.806 6.61e-09 ***
areasCity East -0.1705299 0.0119988 -14.212 < 2e-16 ***
areasCity North -0.0903306 0.0125357 -7.206 6.23e-13 ***
areasCity West -0.0528050 0.0166278 -3.176 0.001500 **
Residual standard error: 0.4502 on 9044 degrees of freedom
(9 observations deleted due to missingness)
Multiple R-squared: 0.2858, Adjusted R-squared: 0.2848
F-statistic: 278.5 on 13 and 9044 DF, p-value: < 2.2e-16
GVIF Df GVIF^(1/(2*Df))
number_of_reviews 1.203308 1 1.096954
review_scores_rating 1.026409 1 1.013119
room_type 1.297248 3 1.044329
bedrooms 2.285705 1 1.511855
beds 3.023981 1 1.738960
accommodates 3.967810 1 1.991936
host_is_superhost 1.195353 1 1.093322
instant_bookable 1.057612 1 1.028403
areas 1.024096 3 1.003976
The regression table confirms that neighbourhood is indeed a significant driver or price. City_Central is the base category for the analysis and is omitted in the model. Relative to this base case, all other neighbourhoods are cheaper. For example, an Airbnb in City_East will be c. 15.7% less expensive compared to the same apartment in City_Centre. Taking Berlin’s history into account, this makes logical sense. The eastern part of the city is the former DDR part, where prices tend to be lower.
For Model7, we include the effect of avalability_30 and reviews_per_month on price.
The variable “availability_30” is also a significant predictor of price_4_nights. The t-statistic is very high and the coefficient is positive, which means that, controlling for all the other variables, the impact of availability in the next month on price is positive.
The variable reviews_per_month does not seem to be a significant predictor as the t value is less than 2. This makes sense, since number of reviews per month are not necessarily related to the quality of the properties and therefore the price of the properties. A cheap rental could equally well have a high number of reviews per month as a medium-priced or more expensive rental. Therefore, this variable is removed from the final version of model 7, along with “beds” which also has a t-statistic <2 and is thereby not statistically relevant.
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.369e+00 3.320e-02 161.707 < 2e-16 ***
number_of_reviews -5.672e-04 9.376e-05 -6.049 1.51e-09 ***
review_scores_rating 4.052e-02 6.480e-03 6.253 4.20e-10 ***
room_typeHotel room 4.490e-01 4.576e-02 9.813 < 2e-16 ***
room_typePrivate room -4.839e-01 9.807e-03 -49.345 < 2e-16 ***
room_typeShared room -1.112e+00 4.605e-02 -24.143 < 2e-16 ***
bedrooms 1.931e-01 1.010e-02 19.128 < 2e-16 ***
beds 2.500e-03 5.416e-03 0.462 0.644
accommodates -1.427e-01 4.996e-03 -28.574 < 2e-16 ***
host_is_superhostTRUE 9.383e-02 1.210e-02 7.756 9.74e-15 ***
instant_bookableTRUE 2.094e-02 9.767e-03 2.144 0.032 *
reviews_per_month 5.388e-03 3.988e-03 1.351 0.177
availability_30 2.342e-02 6.507e-04 36.000 < 2e-16 ***
areasCity East -1.698e-01 1.119e-02 -15.173 < 2e-16 ***
areasCity North -8.749e-02 1.169e-02 -7.483 7.94e-14 ***
areasCity West -9.605e-02 1.555e-02 -6.176 6.85e-10 ***
Residual standard error: 0.4198 on 9042 degrees of freedom
(9 observations deleted due to missingness)
Multiple R-squared: 0.379, Adjusted R-squared: 0.378
F-statistic: 368 on 15 and 9042 DF, p-value: < 2.2e-16
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.368e+00 3.316e-02 161.911 < 2e-16 ***
number_of_reviews -4.976e-04 7.882e-05 -6.313 2.86e-10 ***
review_scores_rating 4.093e-02 6.471e-03 6.325 2.65e-10 ***
room_typeHotel room 4.461e-01 4.571e-02 9.758 < 2e-16 ***
room_typePrivate room -4.847e-01 9.779e-03 -49.566 < 2e-16 ***
room_typeShared room -1.111e+00 4.529e-02 -24.535 < 2e-16 ***
bedrooms 1.932e-01 9.989e-03 19.337 < 2e-16 ***
accommodates -1.412e-01 3.926e-03 -35.977 < 2e-16 ***
host_is_superhostTRUE 9.515e-02 1.206e-02 7.890 3.36e-15 ***
instant_bookableTRUE 2.264e-02 9.684e-03 2.338 0.0194 *
availability_30 2.359e-02 6.400e-04 36.857 < 2e-16 ***
areasCity East -1.701e-01 1.119e-02 -15.205 < 2e-16 ***
areasCity North -8.756e-02 1.169e-02 -7.490 7.53e-14 ***
areasCity West -9.546e-02 1.553e-02 -6.147 8.23e-10 ***
Residual standard error: 0.4198 on 9044 degrees of freedom
(9 observations deleted due to missingness)
Multiple R-squared: 0.3789, Adjusted R-squared: 0.378
F-statistic: 424.4 on 13 and 9044 DF, p-value: < 2.2e-16
GVIF Df GVIF^(1/(2*Df))
number_of_reviews 1.209670 1 1.099850
review_scores_rating 1.031352 1 1.015555
room_type 1.304914 3 1.045355
bedrooms 2.244379 1 1.498125
accommodates 2.466464 1 1.570498
host_is_superhost 1.195437 1 1.093360
instant_bookable 1.069316 1 1.034078
availability_30 1.121507 1 1.059012
areas 1.027859 3 1.004590
As the summary statistics above indicate, our final model comprises of statistically significant variables only and has an adjusted R-squared value of 0.378. This means that the model helps to explain 37.8% of the variation in the log-transformed price. At first, this seems like a mediocre model, since almost 2/3 of the variation in price remains unexplained. However, given the fact that rental prices are very subjective to their specific location (as opposed to the mere neighborhood), the quality of the amenities, the last date of redevelopment, and many other factors, we consider an R-squared of almost 40% as satisfactory. For instance, the addition of the simplified neighborhood variable only added c. 2 percentage points of explanatory power in our analysis and we are confident that in a future investigation one should put more emphasis on this variable and possibly consider factors such as “distance to public transport” or “distance to airport”.
Next to the looking at explanatory power, we should also analyze our model using RMSE. This analysis reveals whether the model actually works on unknown data, or whether it is overfitted to the specifics of the training data. The analysis below proves that model 7 is a good model based on two things. First, rmse_train value is small (0.4193), which means predicated value and actual value are pretty close. Second, the difference (0.001) between rmse_train and rmse_test is small, which means it is a generalized model and can be applied to not only the training set, but also has the predict power to new data.
[1] 0.4193265
[1] 0.4200928
In a future study, it would be interesting to apply the same model to different cities and test how it performs there. One can hypothesize, that in different regions of the world, some variables may have a particularly strong effect on price. For example, in regions that are more unsafe or more heterogeneous than Berlin, the neighborhood variable may be of greater significance. In this case, the RMSE would reveal that the model must be adapted because the accuracy in the test data set would be a lot lower than in the training data.
To provide an overview of the models that we worked with, we can create a summary table of the important parameters. From this table, we can see that between model 2 and 3, as well as between model 6 and 7, we could increase the explanatory power significantly.
| names | model1 | model2 | model3 | model4 | model5 | model6 | model7 | |
|---|---|---|---|---|---|---|---|---|
|
|
|
|
|
|
|
|
||
| 1 | (Intercept) | 5.28229830751976 | 5.22714373833038 | 5.40399208243834 | 5.42645420243141 | 5.40349327285124 | 5.46989389255496 | 5.36842999161987 |
| 2 | (0.0449472631482688) | (0.0321995916350208) | (0.0353259904742744) | (0.035289591206124) | (0.0354633763379476) | (0.0354800637557067) | (0.0331566415080243) | |
| 3 | prop_type_simplifiedEntire rental unit | -0.142798334829136 | ||||||
| 4 | (0.0318703038020704) | |||||||
| 5 | prop_type_simplifiedEntire serviced apartment | 0.419892547407141 | ||||||
| 6 | (0.0464712067554619) | |||||||
| 7 | prop_type_simplifiedOther | -0.140866464198057 | ||||||
| 8 | (0.0342008917886018) | |||||||
| 9 | prop_type_simplifiedPrivate room in rental unit | -0.536530137341479 | ||||||
| 10 | (0.0321607106019711) | |||||||
| 11 | number_of_reviews | -0.000259436899185359 | -0.000213135937103914 | 8.84671142095089e-05 | -0.000155228345967399 | -0.000204973080393352 | -0.000269826267833803 | -0.000497645902528479 |
| 12 | (8.04033957299506e-05) | (7.92414504752674e-05) | (7.94353786233337e-05) | (8.47907109431558e-05) | (8.51034925945332e-05) | (8.4300180803711e-05) | (7.88238495811527e-05) | |
| 13 | review_scores_rating | 0.0426308955354075 | 0.0320211919667921 | 0.0283494448486826 | 0.0219118282533087 | 0.024197202524921 | 0.0240102896753886 | 0.0409268479410226 |
| 14 | (0.00688315174779722) | (0.00679192046592831) | (0.00697744574452695) | (0.00699833585033276) | (0.00699813281715178) | (0.00692198931760884) | (0.00647080540437282) | |
| 15 | room_typeHotel room | 0.780634988210945 | 0.752782220011443 | 0.744057667615733 | 0.704935628113934 | 0.69842463940715 | 0.446063814864002 | |
| 16 | (0.0506301012620254) | (0.0487903298450189) | (0.0485865117668741) | (0.0489960635143676) | (0.0484656478641766) | (0.0457115231882595) | ||
| 17 | room_typePrivate room | -0.39559293692107 | -0.481592413788568 | -0.482358095085801 | -0.483289284572908 | -0.480218241201602 | -0.484720607704976 | |
| 18 | (0.00995839000720386) | (0.0106400862010558) | (0.0105993130247822) | (0.0105825518896458) | (0.0104891694754441) | (0.00977928191805773) | ||
| 19 | room_typeShared room | -1.03813268110591 | -0.896742508228079 | -0.873073776663116 | -0.871788918955122 | -0.903471308487009 | -1.11125474767876 | |
| 20 | (0.0503848048096999) | (0.0496395550413163) | (0.0495183051346452) | (0.0494345208281935) | (0.0489465133659055) | (0.0452931953205682) | ||
| 21 | bedrooms | 0.172862216540498 | 0.173257083385276 | 0.175073569908261 | 0.178069114712692 | 0.193160289603151 | ||
| 22 | (0.0109846735623061) | (0.0109381654628655) | (0.0109242795652863) | (0.01080958068873) | (0.00998920781170885) | |||
| 23 | beds | 0.00960474201285841 | 0.00931063760091393 | 0.00920604533187131 | 0.00997785257783069 | |||
| 24 | (0.00589230756540298) | (0.00586678463610014) | (0.00585682536499641) | (0.0058001968671528) | ||||
| 25 | accommodates | -0.123896408268763 | -0.12533236208788 | -0.127421867291505 | -0.128882438569403 | -0.141229135391807 | ||
| 26 | (0.00540513220772034) | (0.0053854661040724) | (0.00538901527427583) | (0.0053388760786438) | (0.00392550899950059) | |||
| 27 | host_is_superhostTRUE | 0.102849392035972 | 0.099850366785892 | 0.099702303431099 | 0.0951543855253531 | |||
| 28 | (0.0130756147859674) | (0.0130641520598685) | (0.0129308906077658) | (0.012059460924183) | ||||
| 29 | instant_bookableTRUE | 0.0589552819362381 | 0.0599616924652385 | 0.0226432049678445 | ||||
| 30 | (0.0104401401993214) | (0.0103274425546466) | (0.00968426946438939) | |||||
| 31 | areasCity East | -0.170529854770176 | -0.170126099141686 | |||||
| 32 | (0.0119988140206354) | (0.0111890528020354) | ||||||
| 33 | areasCity North | -0.0903306430346683 | -0.08756014347055 | |||||
| 34 | (0.0125357427913606) | (0.011690098376889) | ||||||
| 35 | areasCity West | -0.0528049478556697 | -0.0954560273732202 | |||||
| 36 | (0.0166278069895657) | (0.0155287451989084) | ||||||
| 37 | availability_30 | 0.0235899068419489 | ||||||
| 38 | (0.000640043770630418) | |||||||
| 1.1 | #observations | 9873 | 9873 | 9067 | 9058 | 9058 | 9058 | 9058 |
| 2.1 | R squared | 0.164794439590278 | 0.187279349472108 | 0.261146609635777 | 0.266707409823868 | 0.269283003785776 | 0.285849519718787 | 0.37890488910525 |
| 3.1 | Adj. R Squared | 0.164286509997489 | 0.186867511704535 | 0.260494056409577 | 0.265978007380059 | 0.268475313948024 | 0.284822987626388 | 0.378012116389457 |
| 4.1 | Residual SE | 0.488228611782495 | 0.481587468287354 | 0.458115518945514 | 0.456089758685603 | 0.45531323809596 | 0.450196959524332 | 0.419842835622159 |
We now apply the following criteria to find our target listings:
Review score value is higher than 90% of full score
number of reviews are larger than 10
It has a private room
The host identity is verified and is a super host
It is in the neighborhood of Friedrichshain-Kreuzberg
Thee are two beds.
We believe the best model is model 7 as it has the highest adjusted square and the lowest residual SE. We will apply model 7 for price prediction and an interval for the lower and upper bound. Based on the output below, we can see that the prices range from c. 130€ to 310€. However, the “lwr” and “upr” columns tell us that we the spread for each estimated price is extremely high. This should not come as a surprise, since the explanatory power of our model is limited to less than 40%. In the next chapter, we briefly discuss options on how to improve on this in a future analysis.
| fit | lwr | upr |
|---|---|---|
| 150.3410 | 65.94268 | 342.7586 |
| 131.3156 | 57.63560 | 299.1862 |
| 120.3012 | 52.79845 | 274.1063 |
| 141.5224 | 62.08586 | 322.5952 |
| 139.9085 | 61.40755 | 318.7621 |
| 239.3516 | 105.02540 | 545.4796 |
| 161.0301 | 70.67973 | 366.8762 |
| 129.5751 | 56.86532 | 295.2539 |
| 132.7859 | 58.28334 | 302.5236 |
| 162.5664 | 71.35089 | 370.3926 |
| 141.9125 | 62.28504 | 323.3386 |
| 141.0033 | 61.88571 | 321.2685 |
| 160.6203 | 70.49197 | 365.9832 |
| 310.4774 | 136.19860 | 707.7623 |
| 130.9168 | 57.46027 | 298.2791 |
| 160.7136 | 70.54209 | 366.1484 |
In this final section, we summarize the results of our selected model and discuss possible steps that could further improve the analysis.
As mentioned in the introduction, the overall goal of this analysis was to find a set of variables that would help us to predict Airbnb rental prices in Berlin. Our final model defines the following variables as statistically significant drivers of said rental prices:
The coefficients for each of these variables tell us how rental prices are impacted. For instance, the 1.9 coefficient for “Nr. of bedrooms” tells us that rental prices tend to increase with a higher nr. of bedrooms. The standard error for each of the coefficient estimates provides us with an idea of how far we are away from the “true” value of the coefficient. Whenever the the ratio of coefficient to standard error is >2, we can be relatively sure that the variable is in fact statistically significant. In our model, this is the case for all variables. The t-statistic with the closest value to 2 is the one for instant bookability (c. 2.338), which means that even for this variable we can be >95% certain that it drives price.
If we look at the p-value of the overall model (c. 2.2*e^-16), we notice that this is extremely small. This simply means that our overall model helps to explain rental prices with almost absolute certainty. As previously states, the adjusted R-squared (adjusted for the nr. of variables) tells us how much of the variation in price can be explained, where 37.8% is clearly substantial.
Our RMSE analysis also showed us that the model works well on different sub groups of the data. In a further analysis, it would be interesting to apply the model to other cities as well and compare how the explanatory power changes and if any variable becomes an insignificant predictor of price.
Additionally, the completeness of the data set could be improved in further analysis. We had to leave out thousands of rentals because they missed the relevant values for our chosen predictor variables.
Finally, we recommend to be aware of the impact of seasonality and weekday on prices. There are certainly some season where demand for rentals is particularly high (e.g., on national holidays or during summertime). The same goes for certain days of the week (e.g., the weekend being in higher demand than weekdays). In a next analysis, we would therefore like to focus on the impact of these time-related variable on the variation in price.
The data for this project is from insideairbnb.com